[MUSIC]
Text data is very special.
In contrast to the data captured
by machines such as sensors,
text data is produced by humans.
And they also are meant
to be consumed by humans.
And this has some
interesting consequences.
Because it is produced by humans, it tends
to have a lot of useful knowledge about
people's' preferences,
people's' opinions about everything.
And that makes it possible to mine
text data to discover those
latent prefaces of people,
which could be very useful to build
an intelligent system to help people.
You can think about
scientific literature or
so and it's a way to encode
our knowledge about the world.
So it's very high quality content, yet we
have difficulty digesting all the content.
Now as a result of the fact that
text is consumed by we humans,
we also need intelligent software tools
to help people digest the content, or
otherwise we'd miss
a lot of useful content.
This slide shows that the human really
plays important role in test data mining.
We have to consider human in the loop, and
we have to consider the fact that
the text is generated by human.
So, here are some examples of
useful text information systems.
This is by no means a complete
list of all applications.
I categorize them into
different categories.
But you can probably imagine
other kinds of applications.
So let's take a look at some of them.
Search for example,
we all know search engines is special.
Web search engines, iPad,
all of you are using Google, or Bing, or
another web search engine all the time.
And we also have live research assistants.
And in fact, wherever you have a lot of
text data, you would have a search engine.
So for example, you might have
a search box on your laptop.
All right,
to search content on your computer.
So that's one kind of application systems,
but
we also have filtering systems or
recommended systems.
Those systems can push
information to users.
They can recommend useful
information to users.
So again, use filters, spam filters.
Literature the movie recommenders.
Now not of them are necessary
recommending the information to you.
For example email filter,
spam email filter,
this is actually to filter out
the spams from your inbox, all right.
But in nature these are similar systems in
that they have to make a binary decision
regarding whether to retain
a particular document or discard it.
Another kind of systems
are categorization systems.
So for example, in handling emails,
you might prefer automatic,
sorter that would automatically
sort incoming emails into a proper
folders that you created.
Or we might want to categorize product
reviews into positive or negative.
News agencies might be interested in
categorizing news articles into
all kinds of subject categories.
Those are all categorization systems.
Finally there are also systems
that might do more analysis.
And oh, you can say mine text data.
And these can be text mining systems or
information extraction systems,
and they can be
used to analyze text data in more detail
to discover potentially useful knowledge.
For example companies might
be interested in discovering
major complaints from their customers
based on the email messages that the,
they have received from the customers.
All right, so
having a system to support that would
really help improve their productivity and
the customer relations.
Also in business, intelligence companies
are often interested in analyzing product
reviews to understand the relative
strengths of their own products
in comparison with competitors.
And, and so these are all examples
of these test mining systems.
[INAUDIBLE] we have a lot of data
in particular literature data.
So, there's also great opportunity
of using computer systems
to analyze the data to
automatically read literature, and
to gain knowledge, and
to help biologists make discoveries.
And you can imagine many others.
So the point is that with so
much text data,
we can build very useful systems to
help people in many different ways.
Now how do we build this systems?
Well these actually are the main
technologies that we'll be talking
about in this course and the other course
that I'm teaching for this specialization.
The main techniques for
building these systems and also for
harnessing the text data are text
retrieval and text data mining.
So I use this picture to show
the relation between these two
some of the different techniques.
We started with big text data, right?
But for any applications, we don't
necessarily need to use all the data.
Often we only need the small subset of the
most relevant data, and that's shown here.
So text retrieval is to convert big,
raw text data into that small
subset of most relevant data that are most
useful for a particular application.
And this is usually
done by search engines.
And so
this will be covered in this course.
After we have got a small
amount of relevant data,
we also need to further analyze the data
to help people digest the data, or
to turn the data into
actionable knowledge.
And this step is called text mining,
where we use a number of techniques to
mine the data to get useful knowledge or
pairings.
And the knowledge can then be used
in many different applications.
And this part, text mining, will be
covered in the other course that I'm
teaching called Text Mining and Analytics.
The emphasis of this course
is on basic concepts and
practical techniques in text retrieval.
More specifically we will
cover how search engines work.
How to implement a search engine.
How to evaluate a search engine, so
that you know one search engine is
better than another or
one method is better than another.
How to improve and
optimize a search engine system.
And how to build a recommender system.
We also hope to provide a hands on
experience on multiple aspects.
One is to create a test collection for
evaluating search engines.
This is very important for knowing
which technique actually worked well.
And whether your search engine system
is really good for your application.
The other aspect is to experiment
with search engine algorithms.
In practice, you will have to face
choices of different algorithms.
So it's important to know
how to compare them and
to figure out how they work or
maybe potentially, how to improve them.
And finally, we'll provide a platform for
you to do search engine competition.
Where you can compare your different
ideas to see which idea works better
on some data set.
The prerequisites for
this course are minimum.
Basically we hope you have some basic
concepts of computer science, for
example data structures.
And we hope you will be comfortable
with programming, especially in C++.
because that's the language that we'll use
for some of the programming assignments.
The format is lectures plus quizzes,
as often happens in MOOCs.
And we also will provide
a program assignments for
those of you that have
the resources to do that.
We don't really have any required
readings for this course.
That just means if you follow all
the lecture videos carefully,
and you're suppose to know all the basic
concepts and the basic techniques.
But it's always useful to read more, so
here we provide a list of
some useful reference books.
And this in time order, and
that also includes a book that
and I are co-authoring now, and
we make some draft chapters
available on this website.
And we can find more readings and
reference books on this website.
Finally, and this is the course schedule.
That's just the top of the map for
the rest of the course,
and it shows the topics that we will
cover in the remaining lectures.
This picture also shows basic flow of
information in a text information system.
So starting from the big text data, the
first step is to do some natural language
content analysis, because text data is
in the form of natural language text.
So we need to understand
the text to some extent
in order to do something useful for
the users.
So this is the first
topic that we will cover.
And then on top of that as you
can see there are two boxes here.
Those are two types of systems
that can be used to help people
get access to the most relevant data.
Or in other words, those are the two
kinds of systems that will convert
big text data into small
relevant text data.
Search engines are helping
users to search or
to query the data to get
the most relevant documents out.
Recommender systems are to
recommend information to users,
to push information to users.
So those are two, complementary was of
getting users connected to the most
relevant data at the right time.
So this part is called text access,
and this will be the next topic.
And after we cover that we are going
to cover a number of topics,
all about the search engines.
Now the text access
topic is a brief topic,
a brief coverage of
the two kinds of systems.
In the remaining topics, we'll cover
search engines in much more detail.
That includes text retrieval problem,
text retrieval methods, how to evaluate
these methods, implementation of
the system, and web search applications.
And after these, we're going to
go cover the recommender system.
So this is what you expect
in the rest of this course.
Thanks.
[MUSIC]

